The exam consists of 8 parts in which you are asked to conduct analysis of different datasets. Each part is focused on a different dataset. The datasets are included in different R packages and you need to install the packages to access the data. Your analysis should be done using R and your answers should be given in R. For example, if the question is
Your solution should be
x<-rnorm(100,0,1)
hist(x)
You do not need to explain your R code. For example, you do not need to write: “the function hist() was used to produce the histogram.” Your answers to the questions should be the R code that you used to produce the output.
You need to submit the following materials:
You do not need to interpret the results !!! For example, if the question is to fit a One-Way ANOVA model, you do not need to formulate the model and to interpret the results. This means, for example, that you do not need to write “the p-value is 0.007 indicating on a significant effect of the factor.”
You will need to upload your solution to BB. You will receive information about the submission by email.
The second part of the exam will take place (online) on 15/01/2024 at from 08:30 to 11:30.
The oral exam will take place on 16/01/24 and 17/01/24. The schedule is available online in BB.
In this part of the exam, the questions are focused on the real_data_GDI dataset which is a part of the genderstat R package. To access the data you need to install the package. More information can be found on https://cran.r-project.org/web/packages/genderstat/index.html. Use the code below to access the data.
library(genderstat)
data("real_data_GDI")
names(real_data_GDI)
## [1] "country" "female_life_expectancy" "male_life_expectancy"
## [4] "female_mean_schooling" "male_mean_schooling" "female_gni_per_capita"
## [7] "male_gni_per_capita"
How many countries are included in the data? Count missing values in each variable in the data.
Create a new data frame without the missing data. How many countries are left in the data?
Calculate the minimum and maximum for the variables life expectancy, mean schooling & gni per capita of both male and female.
For each gender, sort the life expectancy of all countries from the highest to the lowest, and print the top country.
For each gender, print the 15 countries with the highest life expectancy.
How many countries have both female and male life expectancy higher than 80?
Show the countries listed in question Q1.6.
In this question, we use the dataset that was created in Q1.2 (the dataset without the missing values).
Define a new categorical variable flife_cat in the following way: Re-code the variable female_life_expectancy into three categories:
female_life_expectancy <60: Low.
female_life_expectancy 60-80: Medium.
female_life_expectancy >80: High.
Count how many countries are included in each category.
Figure 2.1
Define a new dataset in which you include the countries for which female are classified with low life expectancy. Sort the data by male life expectancy and print the top 3 countries.
For the dataset in Q2.3, calculate the mean and standard deviation of male life expectancy and produce the output below.
## mean_expectancy std_expectancy
## 1 52.65 2.562616
In this question we use the real_data_GDI dataset without the missing values.
Create a new data frame for countries with male life expectancy is higher than 53. How many countries are included in the new data set?
For the new dataset, calculate a 95% confidence interval for the female life expectancy using a standard normal distribution. Note that you need to program the formula for the confidence interval by yourself.
Write a function that receives a numerical vector and produces as a numerical output 95% confidence interval and the mean of numerical vector. Inside your function, use the R function t.test() to calculate the confidence interval and the mean. Apply this function to female life expectancy in the new data defined in Q3.1.
Use the R package interpretCI (and the meanCI() function) to calculate the confidence interval for the female life expectancy using a standard normal distribution in the new dataset defined in Q3.1.
In this question, we use the real_data_GDI dataset without the missing values.
Figure 4.1
Figure 4.2
Calculate the correlation between the variables female_mean_schooling and female_life_expectancy using the R function cor.test.
Fit a linear regression model which includes the mean schooling for female as predictor and the life expectancy for female as dependent variable. Print only coefficients panel (coefficients, standard error, t values and p values).
Produce a scatter plot of the female_mean_schooling vs female_life_expectancy, and add a regression line as shown in Figure 4.3.
Figure 4.3
For the analysis of this part we use the flying data which is a part of the R package dropout. This is a modified version of the Flying Etiquette Survey data. More information can be found in https://CRAN.R-project.org/package=dropout. The code below can be used to access the data
library(dropout)
data("flying")
names(flying)
## [1] "respondent_id" "travel_frequency"
## [3] "seat_recline" "height"
## [5] "children_under_18" "two_armrests"
## [7] "middle_armrest" "window_shade"
## [9] "moving_to_unsold_seat" "talking_to_seatmate"
## [11] "getting_up_on_6_hour_flight" "obligation_to_reclined_seat"
## [13] "recline_seat_rudeness" "eliminate_reclining_seats"
## [15] "switch_for_friends" "switch_for_family"
## [17] "wake_passenger_bathroom" "wake_passenger_walk"
## [19] "baby_on_plane" "unruly_children"
## [21] "electronics_violation" "smoking_violation"
## [23] "gender" "age"
## [25] "household_income" "education"
## [27] "location_census_region" "survey_type"
Remove the missing values from the data. How many observations remain in the data?
For the rest of question 5 we use the flying data without the missing values. Produce the data frame shown below, which shows the number of respondents for each age and gender category.
## age gender n
## 1 18-29 Female 75
## 2 18-29 Male 62
## 3 30-44 Female 78
## 4 30-44 Male 95
## 5 45-60 Female 95
## 6 45-60 Male 108
## 7 > 60 Female 87
## 8 > 60 Male 77
Figure 5.1
Figure 5.2
Figure 5.3
In this question, we use the flying data without the missing values.
## gender baby_on_plane n percentage
## 1 Female No, not at all rude 255 76.119403
## 2 Female Yes, somewhat rude 58 17.313433
## 3 Female Yes, very rude 22 6.567164
## 4 Male No, not at all rude 214 62.573099
## 5 Male Yes, somewhat rude 89 26.023392
## 6 Male Yes, very rude 39 11.403509
Figure 6.1
Count the distribution of the respondents’ answers (from each gender and age group) to the question “is it rude to bring a baby on a plane?”.
Produce plot in Figure 6.2.
Figure 6.2
Figure 6.3
In this question we focus on the flying data without missing values.
## baby_on_plane
## gender No, not at all rude Yes, somewhat rude Yes, very rude
## Female 255 58 22
## Male 214 89 39
Use a chi-square test to test the hypothesis gender and baby_on_plane are independent.
Define an R object for the test statistic, plot the density plot of the test statistic under the null hypothesis and add the line for the observed test statistic.
Prepare a presentation of 5-10 slides using R markdown about the connection between the gender and the variable baby_on_plane. Make sure that your presentation includes:
Please note that you WILL NOT be asked to give the presentation and you WILL NOT be asked questions about the presentation. Your aim in this question is to demonstrate that you know how to use R markdown to make a presentation about your analysis. More details how to make a presentation using R markdown: https://rmarkdown.rstudio.com/lesson-11.html.
In this part of the exam, the questions are focused on the unemp dataset which is a part of the viridis R package. To access the data you need to install the package. More information can be found in https://cran.r-project.org/web/packages/viridis/viridis.pdf. Use the code below to access the data.
library(viridis)
data(unemp)
names(unemp)
## [1] "id" "state_fips" "county_fips" "name" "year"
## [6] "rate" "county" "state"
For the unemp dataset,
How many observations are included in the dataset? How many states are included in this dataset?
How many counties there are in NY?
Create a new data frame named unemp_NY for NY state. Produce the following output for the variable rate:
## min_rate_NY max_rate_NY mean_rate_NY
## 1 5.6 13.3 8.009677
Create a new data frame named sub_unemp, which includes data of 3 states: GA, TX and VA.
How many observations are included in the new data frame?
Produce Figure 10.1 presented below.
Figure 10.1
Save Figure 10.1, produced in Q10.2, as a png file and include it in the zip file of your solution.
Conduct a t-test to test the hypothesis that the unemployment rate in states TX and VA is equal against a two-sided alternative. What is the value of the test statistic? How many observations were included in the analysis?
Create a new R object that contains the upper and lower limit of the \(95\%\) confidence interval for the mean difference. DO NOT use xxx<-c(-0.2592,0.6928).
Test if the variance of the unemployment rate in the two states is equal.
If needed, adjust your analysis in Q10.4 according to the result obtained in Q10.6.
In this part, the questions are focused on the pigs dataset which is a part of the emmeans R package. To access the data you need to install the package. More information can be found by help(pigs). You can use the code below to access the data.
library(emmeans)
data(pigs)
names(pigs)
## [1] "source" "percent" "conc"
In this question, we use the pigs dataset without the missing values.
Figure 11.1
For the new data, compute summary statistics (count, mean, sd) of the concentration of free plasma leucine (the variable conc) by the variable percent_class.
Use the function aov() to fit a one-way ANOVA model in which the concentration of free plasma leucine (the variable conc) is the independent variable and the protein percentage in the diet (percent_class) is the factor.
Print the ANOVA table for the model.
Create a new R object, F.value, that contains the value of the F test statistics. DO NOT use F.value=1.858.
Produce the diagnostic plots (qq normal plot for residuals and histogram for residuals) presented in Figure 11.2 and 11.3 below.
Figure 11.2
Figure 11.3
In this part we use the data fish which is a part of the rrcov R package. To access the data you need to install the package. More information can be found in https://search.r-project.org/CRAN/refmans/rrcov/html/fish.html. You can use the code below to access the data.
library(rrcov)
data(fish)
names(fish)
## [1] "Weight" "Length1" "Length2" "Length3" "Height" "Width" "Species"
In this question we use the fish dataset WITH the missing values.
Produce a frequency table for the number fish for each species.
Observation 14 has a missing value in variable Weight. Remove this observation from the data and create a new dataset, fish2. Use the new dataset to create a bar plot for the weight by species as shown in Figure 12.1.
Figure 12.1
Figure 12.2
For the new dataset defined in Q12.2.
Use a for loop to calculate the correlation between Weight and Width for each species. This implies that for each step in the for loop another species will be selected and the correlation between Weight and Width will be calculated and printed.
Produce the following output WITHOUT using a for loop. Note that the variable Correlation is the correlation between Weight and Width.
## # A tibble: 7 × 2
## Species Correlation
## <int> <dbl>
## 1 1 0.344
## 2 2 0.216
## 3 3 0.449
## 4 4 0.638
## 5 5 0.648
## 6 6 0.0430
## 7 7 0.563
## # A tibble: 7 × 4
## Species avg_w avg_L1 avg_h
## <int> <dbl> <dbl> <dbl>
## 1 1 626 30.3 39.6
## 2 2 531 28.8 29.2
## 3 3 152. 20.6 26.7
## 4 4 155. 18.7 39.3
## 5 5 11.2 11.3 16.9
## 6 6 719. 42.5 15.8
## 7 7 382. 25.7 26.3
In this part we use the data msleep which is a part of the ggplot2 R package. To access the data you need to install the package. More information can be found in https://github.com/tidyverse/ggplot2/blob/main/data-raw/msleep.csv. You can use the code below to access the data.
library(dplyr)
data("msleep", package = "ggplot2")
head(msleep, 5)
## # A tibble: 5 × 11
## name genus vore order conservation sleep_total sleep_rem sleep_cycle awake
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Cheetah Acin… carni Carn… lc 12.1 NA NA 11.9
## 2 Owl mo… Aotus omni Prim… <NA> 17 1.8 NA 7
## 3 Mounta… Aplo… herbi Rode… nt 14.4 2.4 NA 9.6
## 4 Greate… Blar… omni Sori… lc 14.9 2.3 0.133 9.1
## 5 Cow Bos herbi Arti… domesticated 4 0.7 0.667 20
## # ℹ 2 more variables: brainwt <dbl>, bodywt <dbl>
How many observations and variables are included in the data?
Create a summary table of average sleep time (the variable sleep_total) for each level of the variable order, sorted in descending order of average sleep time.
Figure 16.1
Figure 16.2
Figure 16.3
In this part we use the ChickWeight data which is a part of the R datasets. To access the data you need to install the package. More information can be found in https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/ChickWeight. You can use the code below to access the data.
data(ChickWeight)
head(ChickWeight)
## Grouped Data: weight ~ Time | Chick
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
Write a function that receives a dataset and a variable as an input and output returns the mean, median, and standard deviation of the variable rounded to 2 decimal places. Apply this function to the ChickWeight data and the variable weight.
In the output below, both numerical and graphical output were
produced using the user function my.analysis(). The function
receives as an input: (1) a dataset name, (2) the column number of
variable 1 (a numerical variable) and (3) the column number of variable
2 (a factor). Note that both variable 1 and variable 2 are a part of the
dataset. For the analysis in this question we use the
ChickWeight dataset at time 0 (so only observations at time 0
are included). The output was produce using the following code:
my.analysis(ChickWeight0,1,4).
The dataset ChickWeight0 contains the observations that were
measured at time 0. Based on the output below, your task in the question
is to write the function my.analysis() and to produce the
output using the code above. Note that your function should produce an
identical output.
## $`Summary statistics`
## Diet Mean SD n
## 1 1 41.4 0.9947229 20
## 2 2 40.7 1.4944341 10
## 3 3 40.8 1.0327956 10
## 4 4 41.0 1.0540926 10
##
## $`ANOVA table`
## Df Sum Sq Mean Sq F value Pr(>F)
## dataset[, var2] 3 4.32 1.440 1.132 0.346
## Residuals 46 58.50 1.272
##
## $`Sample mean and 95% CI`
##
## $`Plot of the residuals`
Figure 18.1
Figure 18.2
In this part we use the data quakes which is a part of the R datasets collection. Use help() to get more information about the data. More information can be found in https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/quakes. You can use the code below to access the data.
library(datasets)
data("quakes")
head(quakes)
## lat long depth mag stations
## 1 -20.42 181.62 562 4.8 41
## 2 -20.62 181.03 650 4.2 15
## 3 -26.00 184.10 42 5.4 43
## 4 -17.97 181.66 626 4.1 19
## 5 -20.42 181.96 649 4.0 11
## 6 -19.68 184.31 195 4.0 12
Figure 19.1
lat),
longitude (long), and depth (depth) of
earthquakes in the quakes dataset. DO NOT
include this figure in the PDF document for your answers but ONLY in the
HTML document.Figure 19.2
Calculate the mean Richter Magnitude (the variable mag) by the station (the variable stations).
Create a new dataset which contains the observations from the top 25 stations with the highest Richter Magnitude. How many observations are included?
For the new data, define a new variable that is equal to the
ratio between Richter Magnitude and the depth, i.e., \[ratio=\frac{ Richter
Magnitude}{depth}\].
Sort the data according to the variable ratio.
Print the three stations with the highest mean ratio.
Create a new dataset for the stations with ratio higher than 0.099. For the new data, produce Figure 20.1.
Figure 20.1
Figure 20.2